---
title: HUD Evals
description: Use ComputerAgent with HUD for benchmarking and evaluation
---

<Callout>
  A corresponding <a href="https://github.com/trycua/cua/blob/main/notebooks/eval_osworld.ipynb" target="_blank">Jupyter Notebook</a> is available for this documentation.
</Callout>

The HUD integration allows an agent to be benchmarked using the [HUD framework](https://www.hud.so/). Through the HUD integration, the agent controls a computer inside HUD, where tests are run to evaluate the success of each task.

## Installation

First, install the required package:

```bash
pip install "cua-agent[hud]"
## or install hud-python directly
# pip install hud-python==0.4.12
```

## Environment Variables

Before running any evaluations, you’ll need to set up your environment variables for HUD and your model providers:

```bash
# HUD access
export HUD_API_KEY="your_hud_api_key"

# Model provider keys (at least one required)
export OPENAI_API_KEY="your_openai_key"
export ANTHROPIC_API_KEY="your_anthropic_key"
```

## Running a Single Task

You can run a single task from a HUD dataset for quick verification.

### Example

```python
from agent.integrations.hud import run_single_task

await run_single_task(
    dataset="hud-evals/OSWorld-Verified",   # or another HUD dataset
    model="openai/computer-use-preview+openai/gpt-5-nano",  # any supported model string
    task_id=155,  # e.g., reopen last closed tab
)
```

### Parameters

- `task_id` (`int`): Default: `0`
  Index of the task to run from the dataset.

## Running a Full Dataset

To benchmark your agent at scale, you can run an entire dataset (or a subset) in parallel.

### Example

```python
from agent.integrations.hud import run_full_dataset

results = await run_full_dataset(
    dataset="hud-evals/OSWorld-Verified",   # can also pass a Dataset or list[dict]
    model="openai/computer-use-preview",
    split="train[:3]",           # try a few tasks to start
    max_concurrent=20,            # tune to your infra
    max_steps=50                  # safety cap per task
)
```

### Parameters

- `job_name` (`str` | `None`):
  Optional human-readable name for the evaluation job (shows up in HUD UI).
- `max_concurrent` (`int`): Default: `30`
  Number of tasks to run in parallel. Scale this based on your infra.
- `max_steps` (`int`): Default: `50`
  Safety cap on steps per task to prevent infinite loops.
- `split` (`str`): Default: `"train"`
  Dataset split or subset to run. Uses the [Hugging Face split format](https://huggingface.co/docs/datasets/v1.11.0/splits.html), e.g., `"train[:10]"` for the first 10 tasks.

## Additional Parameters

Both single-task and full-dataset runs share a common set of configuration options. These let you fine-tune how the evaluation runs.

- `dataset` (`str` | `Dataset` | `list[dict]`): **Required**
  HUD dataset name (e.g. `"hud-evals/OSWorld-Verified"`), a loaded `Dataset`, or a list of tasks.
- `model` (`str`): Default: `"computer-use-preview"`
  Model string, e.g. `"openai/computer-use-preview+openai/gpt-5-nano"`. Supports composition with `+` (planning + grounding).
- `allowed_tools` (`list[str]`): Default: `["openai_computer"]`
  Restrict which tools the agent may use.
- `tools` (`list[Any]`):
  Extra tool configs to inject.
- `custom_loop` (`Callable`):
  Optional custom agent loop function. If provided, overrides automatic loop selection.
- `only_n_most_recent_images` (`int`): Default: `5` for full dataset, `None` for single task.
  Retain only the last N screenshots in memory.
- `callbacks` (`list[Any]`):
  Hook functions for logging, telemetry, or side effects.
- `verbosity` (`int`):
  Logging level. Set `2` for debugging every call/action.
- `trajectory_dir` (`str` | `dict`):
  Save local copies of trajectories for replay/analysis.
- `max_retries` (`int`): Default: `3`
  Number of retries for failed model/tool calls.
- `screenshot_delay` (`float` | `int`): Default: `0.5`
  Delay (seconds) between screenshots to avoid race conditions.
- `use_prompt_caching` (`bool`): Default: `False`
  Cache repeated prompts to reduce API calls.
- `max_trajectory_budget` (`float` | `dict`):
  Limit on trajectory size/budget (e.g., tokens, steps).
- `telemetry_enabled` (`bool`): Default: `True`
  Whether to send telemetry/traces to HUD.
- `**kwargs` (`any`):
  Any additional keyword arguments are passed through to the agent loop or model provider.

## Available Benchmarks

HUD provides multiple benchmark datasets for realistic evaluation.

1. **[OSWorld-Verified](/agent-sdk/benchmarks/osworld-verified)** – Benchmark on 369+ real-world desktop tasks across Chrome, LibreOffice, GIMP, VS Code, etc.
   _Best for_: evaluating full computer-use agents in realistic environments.
   _Verified variant_: fixes 300+ issues from earlier versions for reliability.

**Coming soon:** SheetBench (spreadsheet automation) and other specialized HUD datasets.

See the [HUD docs](https://docs.hud.so/environment-creation) for more eval environments.

## Tips

- **Debugging:** set `verbosity=2` to see every model call and tool action.
- **Performance:** lower `screenshot_delay` for faster runs; raise it if you see race conditions.
- **Safety:** always set `max_steps` (defaults to 50) to prevent runaway loops.
- **Custom tools:** pass extra `tools=[...]` into the agent config if you need beyond `openai_computer`.
